Skip to content

GH-3542 Fix LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers#3543

Open
arouel wants to merge 2 commits intoapache:masterfrom
arouel:fix-local-input-file-read
Open

GH-3542 Fix LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers#3543
arouel wants to merge 2 commits intoapache:masterfrom
arouel:fix-local-input-file-read

Conversation

@arouel
Copy link
Copy Markdown
Contributor

@arouel arouel commented May 4, 2026

Rationale for this change

LocalInputFile's read(ByteBuffer) and readFully(ByteBuffer) are broken for two independent reasons:

  1. They pass buf.position() + buf.arrayOffset() as the source offset to ByteBuffer.put(byte[] src, int offset, int length). The source is a freshly-allocated local byte[] whose indices have no relationship to buf.position() or buf.arrayOffset(), this reads from the wrong offset of the source array whenever the destination buffer's position is non-zero.
  2. They call buf.arrayOffset(), which throws UnsupportedOperationException on direct buffers, memory-mapped buffers, and read-only views. ParquetFileReader.readFooter passes exactly such buffer shapes, so any consumer wrapping a Path with new LocalInputFile(path) can hit this bug.
    read(ByteBuffer) has an additional defect: it advances the destination by buf.remaining() regardless of how many bytes read(byte[]) actually returned, corrupting the buffer on short reads and at EOF.

What changes are included in this PR?

  • parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java:
    • readFully(ByteBuffer) now uses buf.put(buffer), a straight array-to-buffer copy that works for every ByteBuffer shape and never calls arrayOffset().
    • read(ByteBuffer) copies only the bytes actually read (buf.put(buffer, 0, read) when read > 0) and returns the underlying read result, so -1 at EOF propagates correctly and the destination's position is left untouched on EOF.

Are these changes tested?

Yes. TestLocalInputOutput gains eight regression tests that fail against the previous implementation and pass with the fix.

Are there any user-facing changes?

No API signatures changed.

Closes #3542

arouel added 2 commits May 4, 2026 10:52
Summary
Root cause (parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java)
Both read(ByteBuffer) and readFully(ByteBuffer) called:
buf.put(buffer, buf.position() + buf.arrayOffset(), buf.remaining());
Two independent bugs:
1. Wrong put overload semantics. ByteBuffer.put(byte[] src, int offset, int length) reads src[offset..offset+length]. The code passed buf.position() + buf.arrayOffset() as the source offset, but the source is the freshly-allocated buffer array whose indices have no relationship to buf.position() or buf.arrayOffset(). On heap buffers with position=0 and arrayOffset=0 this happens to work; any other state reads from the wrong offset (or throws ArrayIndexOutOfBoundsException).
2. arrayOffset() is not universally defined. Direct buffers, memory-mapped buffers, and read-only views all throw UnsupportedOperationException from arrayOffset(). Any Parquet consumer passing such a buffer (which is exactly what the steampipe integration test triggers via ParquetFileReader.readFooter) blows up at the first call.
3. Partial-read accounting. read(ByteBuffer) unconditionally copied buf.remaining() bytes regardless of how many read(byte[]) actually returned, corrupting the destination buffer on short reads / EOF.
Fix
- readFully(ByteBuffer) → buf.put(buffer) (straight array-to-buffer copy, no arrayOffset()).
- read(ByteBuffer) → only put read bytes when read > 0, return read (including -1 at EOF).
Tests added
TestLocalInputOutput gained 8 regression cases in parquet-common/src/test/java/org/apache/parquet/io/:
- readFullyIntoHeapByteBuffer — baseline happy path.
- readFullyIntoHeapByteBufferWithNonZeroPosition — exercises the old wrong-source-offset bug.
- readFullyIntoDirectByteBuffer — direct buffers; arrayOffset() would have thrown.
- readFullyIntoReadOnlyByteBuffer — read-only views; asserts ReadOnlyBufferException (correct JDK behaviour after the fix).
- readIntoHeapByteBuffer — baseline.
- readIntoByteBufferAdvancesPositionByBytesRead — exercises the partial-read accounting bug.
- readIntoByteBufferReturnsMinusOneAtEof — EOF signalling.
- readIntoDirectByteBuffer — direct buffer read path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers

1 participant